Vignette for Modeling Interactions with Treatment in High Dimensions: Binary Outcome
The purpose of this vignette is to offer a guidance for colleagues looking to discover biomarkers whose association with an outcome may differ by treatment arm. We particularly focus on the high dimensional setting, however many of these ideas are easily applicaple to a low dimensional setting as well.
Much of this vignette is similarly covered in another vignette made for when there is a survival outcome (https://github.com/apfela2/Vignettes/Vignette_grpLasso_Survival) however there are some differences. In particular, there is a different function used to run the group lasso regression (and consequently different wrapper functions). Here we also demonstrate an example where we use standard dummy coding for treatment as opposed to +1/-1 coding, as is discussed in the vignettes.
In this vignette we will cover parameterizing your model in a way to focus on identifying biomarkers which interact with treatment, on the importance of using splines when investigating interactions, and our recommended approach to model certain biomarkers where there seems to be a point of discontinuity. We also focus on several challenges unique to the penalized regression setting such as including splines and interactions, scaling continuous variables appropriately for proper inference when categorical variables are also present, and appropriate visualizations.
We used data from 9 clinical trials which were made up of 1,137 patients who were evaluable for Outcome and received one of 2 possible treatments (initial cohort of 1,803). Out of the 1,137 patients in our analysis, 337 were positive for the binary outcome of interest (“Outcome”). Thus leaving us with an effective sample size of 337. See Vignette on Regression Modeling Strategies (https://github.com/apfela2/Vignettes/Vignette_Regression_Strategies) for more information on calculating and how to best make use of effective sample size.
The goal of our analysis was to discover if any biomarkers, from a pre-selected list of 18, differ in their association with “Outcome” by treatment.
The first step we do after loading our data is to scale all continuous variables by 2 times the standard deviation (instead of the more typically applied 1 times the sd). This is in order to allow the continuous variables to be on the same scale as the categorical ones. See (Gelman, 2008).
Note due to the complexity of the analysis beyond the scope of this vignette, we had to develop several models and thus I needed to run a Config file to help reduce redundant coding. Thus the code I present below makes use of one of the Config files for which I share the code below.
library(bmsPURR)
# library(GGally)
library(corrplot)
# library(mice)
# library(lme4)
library(rms)
library(rmsb)
library(parallel)
# library(doParallel)
# library(glinternet)
# library(TunePareto)
# library(survivalROC)
# library(survcomp)
# library(timeROC)
library(grpreg)
# library(CalibrationCurves)
# library(splines)
# library(ellipse)
library(tidyverse)
# Config File:
yimp <- "No"
lab_yimp <- "NoY_"
simon <- "No"
lab_simon <- "NoSimon"
ver <- "v01"
priorI <- "No"
lab_priorI <- "NoPriorI_"
describe(imp_dat)
## imp_dat
##
## 20 Variables 1140 Observations
## --------------------------------------------------------------------------------
## Cont01
## n missing distinct Info Mean Gmd .05 .10
## 1138 2 766 1 4.841 2.433 2.266 2.650
## .25 .50 .75 .90 .95
## 3.307 4.226 5.598 7.300 9.307
##
## lowest : 1.178 1.22 1.596 1.652 1.7 , highest: 24.9 26.877 28.336 30.031 36.774
## --------------------------------------------------------------------------------
## Cont02
## n missing distinct Info Mean Gmd .05 .10
## 1140 0 67 0.999 58.44 15.31 33 39
## .25 .50 .75 .90 .95
## 49 60 68 74 79
##
## lowest : 18 23 24 25 26, highest: 85 86 87 89 90
## --------------------------------------------------------------------------------
## Cont03
## n missing distinct Info Mean Gmd .05 .10
## 1140 0 236 1 70.16 58.39 12.0 17.0
## .25 .50 .75 .90 .95
## 29.0 53.0 93.0 141.0 185.1
##
## lowest : 9 10 10.07 10.1 10.76, highest: 332 339 343 372 374
## --------------------------------------------------------------------------------
## Cont04
## n missing distinct Info Mean Gmd .05 .10
## 1113 27 289 1 27.78 5.876 20.40 21.90
## .25 .50 .75 .90 .95
## 24.00 27.10 30.70 34.88 37.31
##
## lowest : 12.6 17 17.1 17.2 17.3, highest: 49.5 51.1 53.4 54 56.5
## --------------------------------------------------------------------------------
## Cont05
## n missing distinct Info Mean Gmd .05 .10
## 943 197 255 1 119.1 140.5 19.0 24.2
## .25 .50 .75 .90 .95
## 36.0 57.0 99.0 201.6 350.1
##
## lowest : 1 2 4 5 6, highest: 2417 2480 2513 2681 2702
## --------------------------------------------------------------------------------
## Cont06
## n missing distinct Info Mean Gmd .05 .10
## 1138 2 89 1 135.2 18.73 103 113
## .25 .50 .75 .90 .95
## 125 137 147 155 160
##
## lowest : 77 78 79 81 87, highest: 169 170 171 175 180
## --------------------------------------------------------------------------------
## Cont07
## n missing distinct Info Mean Gmd .05 .10
## 1138 2 579 1 1.56 0.7054 0.7000 0.8527
## .25 .50 .75 .90 .95
## 1.1100 1.4595 1.8992 2.3600 2.7500
##
## lowest : 0.19 0.272 0.3 0.31 0.32 , highest: 4.439 4.77 4.896 5.98 7.3
## --------------------------------------------------------------------------------
## Cont08
## n missing distinct Info Mean Gmd .05 .10
## 1137 3 338 1 266.6 99.2 152.8 170.0
## .25 .50 .75 .90 .95
## 203.0 248.0 305.0 378.4 439.6
##
## lowest : 77 100 107 110 111, highest: 692 708 776 831 1056
## --------------------------------------------------------------------------------
## Cont09
## n missing distinct Info Mean Gmd .05 .10
## 958 182 31 0.941 9.953 15.81 0 0
## .25 .50 .75 .90 .95
## 0 1 6 30 60
##
## lowest : 0 1 2 3 4, highest: 75 80 90 95 100
## --------------------------------------------------------------------------------
## Cont10
## n missing distinct Info Mean Gmd .05 .10
## 1125 15 441 1 2.415 0.2706 2.137 2.176
## .25 .50 .75 .90 .95
## 2.233 2.338 2.538 2.755 2.907
##
## lowest : 1.88081 1.90309 1.99564 2.00432 2.01703
## highest: 3.43489 3.44871 3.50365 3.71332 3.76856
## --------------------------------------------------------------------------------
## Cat01
## n missing distinct
## 1137 3 2
##
## Value 0 >=1
## Frequency 847 290
## Proportion 0.745 0.255
## --------------------------------------------------------------------------------
## Cat02
## n missing distinct Info Sum Mean Gmd
## 1140 0 2 0.655 367 0.3219 0.437
##
## --------------------------------------------------------------------------------
## Cat03
## n missing distinct
## 1140 0 2
##
## Value NO YES
## Frequency 393 747
## Proportion 0.345 0.655
## --------------------------------------------------------------------------------
## Cat04
## n missing distinct
## 1066 74 4
##
## Value TypeA TypeB TypeC TypeD
## Frequency 55 157 215 639
## Proportion 0.052 0.147 0.202 0.599
## --------------------------------------------------------------------------------
## Cat05
## n missing distinct
## 1140 0 3
##
## Value TypeA TypeB TypeC
## Frequency 610 382 148
## Proportion 0.535 0.335 0.130
## --------------------------------------------------------------------------------
## Cat06
## n missing distinct Info Sum Mean Gmd
## 1140 0 2 0.182 74 0.06491 0.1215
##
## --------------------------------------------------------------------------------
## Cat07
## n missing distinct Info Sum Mean Gmd
## 1140 0 2 0.469 221 0.1939 0.3128
##
## --------------------------------------------------------------------------------
## Cat08
## n missing distinct
## 1058 82 2
##
## Value NO YES
## Frequency 690 368
## Proportion 0.652 0.348
## --------------------------------------------------------------------------------
## ARM2
## n missing distinct
## 1140 0 2
##
## Value PLACEBO TREATMENT
## Frequency 521 619
## Proportion 0.457 0.543
## --------------------------------------------------------------------------------
## Outcome
## n missing distinct Info Sum Mean Gmd
## 1140 0 2 0.627 339 0.2974 0.4182
##
## --------------------------------------------------------------------------------
cont_bm <- c("Cont01", "Cont02", "Cont03", "Cont04", "Cont05", "Cont06",
"Cont07", "Cont08", "Cont09", "Cont10")
cat_bm <- c("Cat01", "Cat02", "Cat03", "Cat04", "Cat05", "Cat06", "Cat07", "Cat08")
trt <- "ARM2"
# Capture mean and sd for purpose of external validation
means <- apply(imp_dat[, cont_bm], 2, mean, na.rm = T)
sds <- apply(imp_dat[, cont_bm], 2, sd, na.rm = T)
# Create new data frame with means and standard deviations
mean_sd <- data.frame(variable = cont_bm, mean = means, sd = sds)
scale2 <- function(x) {
(x - mean(x, na.rm = T))/(2*sd(x, na.rm = T))
}
# Fix up data
imp_dat <- mutate_at(imp_dat, cont_bm, list(s = ~scale2(.)))
cont_bm <- paste0(cont_bm, "_s")
We then use single imputation WITHOUT using outcome for missing data. This is a conservative approach. If we would use multiple imputation we would be able to use the outcome as part of the imputation model but due to complexities with combining multiple imputation with penalized regression we opt for the simpler, conservative approach of single imputation without the outcome in the imputation model.
See Regression Modeling Strategies vignette (https://github.com/apfela2/Vignettes/Vignette_Regression_Strategies) for more details about best imputation practices.
# Perform single imputation WITHOUT using outcome: conservative approach
# Explore missing patterns
na.patterns <- naclus(imp_dat)
naplot(na.patterns)
imp_dat$numvar_missing <- na.patterns$na.per.obs
# # Remove subjects missing on >=5 variables (3 subjects)
red_dat <- filter(imp_dat, numvar_missing <= 7)
na.patterns <- naclus(red_dat)
naplot(na.patterns)
if(yimp == "No"){
naplot(na.patterns, "na per var")
mtext("Single Imputation: No Y", side = 3)
} else if(yimp == "Yes"){
naplot(na.patterns, "na per var")
mtext("Single Imputation: With Y", side = 3)
}
# Explore correlation of continuous biomarkers
describe(red_dat)
## red_dat
##
## 31 Variables 1137 Observations
## --------------------------------------------------------------------------------
## Cont01
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 766 1 4.84 2.434 2.265 2.650
## .25 .50 .75 .90 .95
## 3.306 4.224 5.592 7.300 9.310
##
## lowest : 1.178 1.22 1.596 1.652 1.7 , highest: 24.9 26.877 28.336 30.031 36.774
## --------------------------------------------------------------------------------
## Cont02
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 67 0.999 58.44 15.29 33 39
## .25 .50 .75 .90 .95
## 49 60 68 74 79
##
## lowest : 18 23 24 25 26, highest: 85 86 87 89 90
## --------------------------------------------------------------------------------
## Cont03
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 236 1 70.2 58.42 12.0 17.0
## .25 .50 .75 .90 .95
## 29.0 53.0 93.0 141.0 185.6
##
## lowest : 9 10 10.07 10.1 10.76, highest: 332 339 343 372 374
## --------------------------------------------------------------------------------
## Cont04
## n missing distinct Info Mean Gmd .05 .10
## 1111 26 288 1 27.8 5.87 20.40 21.90
## .25 .50 .75 .90 .95
## 24.05 27.10 30.70 34.90 37.31
##
## lowest : 12.6 17 17.1 17.2 17.3, highest: 49.5 51.1 53.4 54 56.5
## --------------------------------------------------------------------------------
## Cont05
## n missing distinct Info Mean Gmd .05 .10
## 943 194 255 1 119.1 140.5 19.0 24.2
## .25 .50 .75 .90 .95
## 36.0 57.0 99.0 201.6 350.1
##
## lowest : 1 2 4 5 6, highest: 2417 2480 2513 2681 2702
## --------------------------------------------------------------------------------
## Cont06
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 89 1 135.2 18.73 103 113
## .25 .50 .75 .90 .95
## 125 137 147 155 160
##
## lowest : 77 78 79 81 87, highest: 169 170 171 175 180
## --------------------------------------------------------------------------------
## Cont07
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 579 1 1.561 0.7052 0.7000 0.8554
## .25 .50 .75 .90 .95
## 1.1100 1.4600 1.9000 2.3600 2.7500
##
## lowest : 0.19 0.272 0.3 0.31 0.32 , highest: 4.439 4.77 4.896 5.98 7.3
## --------------------------------------------------------------------------------
## Cont08
## n missing distinct Info Mean Gmd .05 .10
## 1136 1 338 1 266.6 99.26 152.8 170.0
## .25 .50 .75 .90 .95
## 203.0 248.0 305.0 378.5 439.8
##
## lowest : 77 100 107 110 111, highest: 692 708 776 831 1056
## --------------------------------------------------------------------------------
## Cont09
## n missing distinct Info Mean Gmd .05 .10
## 958 179 31 0.941 9.953 15.81 0 0
## .25 .50 .75 .90 .95
## 0 1 6 30 60
##
## lowest : 0 1 2 3 4, highest: 75 80 90 95 100
## --------------------------------------------------------------------------------
## Cont10
## n missing distinct Info Mean Gmd .05 .10
## 1124 13 440 1 2.415 0.2707 2.137 2.176
## .25 .50 .75 .90 .95
## 2.233 2.338 2.538 2.755 2.907
##
## lowest : 1.88081 1.90309 1.99564 2.00432 2.01703
## highest: 3.43489 3.44871 3.50365 3.71332 3.76856
## --------------------------------------------------------------------------------
## Cat01
## n missing distinct
## 1134 3 2
##
## Value 0 >=1
## Frequency 845 289
## Proportion 0.745 0.255
## --------------------------------------------------------------------------------
## Cat02
## n missing distinct Info Sum Mean Gmd
## 1137 0 2 0.656 367 0.3228 0.4376
##
## --------------------------------------------------------------------------------
## Cat03
## n missing distinct
## 1137 0 2
##
## Value NO YES
## Frequency 391 746
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## Cat04
## n missing distinct
## 1065 72 4
##
## Value TypeA TypeB TypeC TypeD
## Frequency 55 156 215 639
## Proportion 0.052 0.146 0.202 0.600
## --------------------------------------------------------------------------------
## Cat05
## n missing distinct
## 1137 0 3
##
## Value TypeA TypeB TypeC
## Frequency 609 380 148
## Proportion 0.536 0.334 0.130
## --------------------------------------------------------------------------------
## Cat06
## n missing distinct Info Sum Mean Gmd
## 1137 0 2 0.18 73 0.0642 0.1203
##
## --------------------------------------------------------------------------------
## Cat07
## n missing distinct Info Sum Mean Gmd
## 1137 0 2 0.47 221 0.1944 0.3135
##
## --------------------------------------------------------------------------------
## Cat08
## n missing distinct
## 1056 81 2
##
## Value NO YES
## Frequency 689 367
## Proportion 0.652 0.348
## --------------------------------------------------------------------------------
## ARM2
## n missing distinct
## 1137 0 2
##
## Value PLACEBO TREATMENT
## Frequency 518 619
## Proportion 0.456 0.544
## --------------------------------------------------------------------------------
## Outcome
## n missing distinct Info Sum Mean Gmd
## 1137 0 2 0.626 337 0.2964 0.4175
##
## --------------------------------------------------------------------------------
## Cont01_s
## n missing distinct Info Mean Gmd .05
## 1137 0 766 1 -0.0001899 0.4276 -0.4528
## .10 .25 .50 .75 .90 .95
## -0.3851 -0.2698 -0.1085 0.1319 0.4321 0.7852
##
## lowest : -0.643694 -0.636314 -0.570244 -0.560404 -0.551969
## highest: 3.5247 3.87209 4.12846 4.42631 5.61117
## --------------------------------------------------------------------------------
## Cont02_s
## n missing distinct Info Mean Gmd .05
## 1137 0 67 0.999 -0.0001845 0.5654 -0.94088
## .10 .25 .50 .75 .90 .95
## -0.71899 -0.34918 0.05761 0.35346 0.57535 0.76026
##
## lowest : -1.4956 -1.31069 -1.27371 -1.23673 -1.19975
## highest: 0.982145 1.01913 1.05611 1.13007 1.16705
## --------------------------------------------------------------------------------
## Cont03_s
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 236 1 0.0003173 0.508 -0.5058 -0.4623
## .25 .50 .75 .90 .95
## -0.3579 -0.1492 0.1986 0.6160 1.0038
##
## lowest : -0.531853 -0.523157 -0.522548 -0.522287 -0.516548
## highest: 2.27691 2.33778 2.37256 2.62474 2.64213
## --------------------------------------------------------------------------------
## Cont04_s
## n missing distinct Info Mean Gmd .05 .10
## 1111 26 288 1 0.001219 0.5419 -0.68158 -0.54311
## .25 .50 .75 .90 .95
## -0.34463 -0.06307 0.26927 0.65699 0.87974
##
## lowest : -1.40165 -0.995458 -0.986226 -0.976995 -0.967763
## highest: 2.0048 2.15251 2.36483 2.42022 2.65101
## --------------------------------------------------------------------------------
## Cont05_s
## n missing distinct Info Mean Gmd .05 .10
## 943 194 255 1 5.741e-18 0.2692 -0.19182 -0.18185
## .25 .50 .75 .90 .95
## -0.15924 -0.11900 -0.03853 0.15806 0.44260
##
## lowest : -0.226305 -0.224389 -0.220557 -0.218641 -0.216725
## highest: 4.40296 4.52367 4.5869 4.90881 4.94904
## --------------------------------------------------------------------------------
## Cont06_s
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 89 1 0.0003209 0.5584 -0.96101 -0.66292
## .25 .50 .75 .90 .95
## -0.30521 0.05249 0.35058 0.58905 0.73810
##
## lowest : -1.73604 -1.70623 -1.67642 -1.6168 -1.43795
## highest: 1.00638 1.03618 1.06599 1.18523 1.33427
## --------------------------------------------------------------------------------
## Cont07_s
## n missing distinct Info Mean Gmd .05 .10
## 1137 0 579 1 0.000541 0.5224 -0.63738 -0.52226
## .25 .50 .75 .90 .95
## -0.33367 -0.07441 0.25152 0.59227 0.88116
##
## lowest : -1.01516 -0.954415 -0.933675 -0.926267 -0.91886
## highest: 2.13228 2.37747 2.4708 3.27377 4.25156
## --------------------------------------------------------------------------------
## Cont08_s
## n missing distinct Info Mean Gmd .05 .10
## 1136 1 338 1 8.856e-05 0.5104 -0.58526 -0.49655
## .25 .50 .75 .90 .95
## -0.32686 -0.09546 0.19765 0.57560 0.89056
##
## lowest : -0.974778 -0.856507 -0.820512 -0.805085 -0.799943
## highest: 2.18769 2.26996 2.61963 2.90245 4.05945
## --------------------------------------------------------------------------------
## Cont09_s
## n missing distinct Info Mean Gmd .05
## 958 179 31 0.941 -2.519e-17 0.3907 -0.24597
## .10 .25 .50 .75 .90 .95
## -0.24597 -0.24597 -0.22125 -0.09769 0.49541 1.23679
##
## lowest : -0.245966 -0.221253 -0.196541 -0.171828 -0.147115
## highest: 1.60749 1.73105 1.97818 2.10174 2.2253
## --------------------------------------------------------------------------------
## Cont10_s
## n missing distinct Info Mean Gmd .05
## 1124 13 440 1 -6.324e-05 0.5206 -0.5349
## .10 .25 .50 .75 .90 .95
## -0.4592 -0.3497 -0.1469 0.2372 0.6545 0.9463
##
## lowest : -1.02711 -0.984269 -0.806269 -0.789562 -0.765112
## highest: 1.96196 1.98854 2.09422 2.49749 2.60374
## --------------------------------------------------------------------------------
## numvar_missing
## n missing distinct Info Mean Gmd
## 1137 0 8 0.721 0.8637 1.257
##
## Value 0 1 2 3 4 5 6 7
## Frequency 732 51 259 24 37 12 21 1
## Proportion 0.644 0.045 0.228 0.021 0.033 0.011 0.018 0.001
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
num_dat <- as.matrix(dplyr::select(red_dat, cont_bm))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(cont_bm)
##
## # Now:
## data %>% select(all_of(cont_bm))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cor_mat <- rcorr(num_dat)
corrplot.mixed(cor_mat$r, lower = "ellipse", upper = "number", tl.pos = "lt")
red_dat <- dplyr::select(red_dat, -numvar_missing)
set.seed(122)
if (yimp == "No"){
MI <- aregImpute( ~ Cont09_s + Cat08 + Cont04_s + Cat04 + Cont10_s + Cont02_s + Cat05 + Cat06 +
Cont03_s + Cat01 + Cont06_s + Cont08_s + Cont07_s + Cont01_s + Cat02 + ARM2 + Cat03 + Cont05_s +
Cat07,
data = red_dat, n.impute = 1, tlinear = F)
} else if (yimp == "Yes"){
MI <- aregImpute( ~ Outcome + Cont09_s + Cat08 + Cont04_s + Cat04 + Cont10_s + Cont02_s + Cat05 + Cat06 +
Cont03_s + Cat01 + Cont06_s + Cont08_s + Cont07_s + Cont01_s + Cat02 + ARM2 + Cat03 + Cont05_s +
Cat07,
data = red_dat, n.impute = 1, tlinear = F)
}
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
# Create dataset with imputed values
imp <- as.data.frame(impute.transcan(MI, imputation=1, data=red_dat, list.out=TRUE, pr=FALSE, check=FALSE))
# Add USUBJID and PrimRes_na
imp_dat <- red_dat
imp_dat[, names(imp)] <- bind_rows(imp)
imp_dat <- imp_dat[, c(cont_bm, cat_bm, trt, "Outcome")]
# Redundancy Analysis
red <- redun(formula = ~ Cont09_s + Cat08 + Cont04_s + Cat04 + Cont10_s + Cont02_s + Cat05 + Cat06 +
Cont03_s + Cat01 + Cont06_s + Cont08_s + Cont07_s + Cont01_s + Cat02 + ARM2 + Cat03 + Cont05_s +
Cat07,
r2 = 0.6, type = "adjusted", data = imp_dat)
print(red)
##
## Redundancy Analysis
##
## ~Cont09_s + Cat08 + Cont04_s + Cat04 + Cont10_s + Cont02_s +
## Cat05 + Cat06 + Cont03_s + Cat01 + Cont06_s + Cont08_s +
## Cont07_s + Cont01_s + Cat02 + ARM2 + Cat03 + Cont05_s + Cat07
##
## n: 1137 p: 19 nk: 3
##
## Number of NAs: 0
##
## Transformation of target variables forced to be linear
##
## R-squared cutoff: 0.6 Type: adjusted
##
## R^2 with which each variable can be predicted from all other variables:
##
## Cont09_s Cat08 Cont04_s Cat04 Cont10_s Cont02_s Cat05 Cat06
## 0.006 0.135 0.080 0.251 0.220 0.179 0.126 0.025
## Cont03_s Cat01 Cont06_s Cont08_s Cont07_s Cont01_s Cat02 ARM2
## 0.327 0.174 0.358 0.336 0.125 0.197 0.207 0.092
## Cat03 Cont05_s Cat07
## 0.195 0.024 0.070
##
## No redundant variables
Now that we have a complete dataset (no more missing), we must make several transformations to run the most appropriate model.
While in general when modeling continuous variables we recommend keeping them in continuous form instead of dichotomizing them to avoid throwing away information, for Cont09 specifically, experience tells us that that there appears to be a point of discontinuity at 0. Therefore we recommend best practice to create an indicator variable for when Cont09 = 0 and model all values greater than 0 as continuous.
Note that this recommendation is independent of the complications of penalized regression and cubic splines we are applying in this analysis.
# All biomarkers with Indicator functions
Ind_bm <- c("Cont09_s")
# Create Indicator functions for Cont09
imp_dat <- mutate(imp_dat, Ind_Cont09_s_Ind = ifelse(Cont09_s == min(Cont09_s), 1, 0))
# All continuous BM excluding those with Indicator functions
cont_bm <- cont_bm[!(cont_bm %in% Ind_bm)]
In order to perform penalized regression with splines, we must transform the continuous variables into basis functions. Here, we chose to apply splines with 3 knots, thus each continuous variable will be transformed to 2 columns of data each representing a different basis function for the spline.
In order to aid in the interpretability of the results, we will group the basis functions together so that they will undergo the same amount of shrinkage, thus we do not have to worry about one basis function for a given variable being removed from the model while the other basis function remains. We will implement this via the group lasso penalty.
NOTE: If you plan on performing External Validation on a different cohort, it is imperative to use the same knot locations to make up the basis functions in the new data as was used in the training data. Thus, we demonstrate in this example by saving our data object after creating the basis functions.
# Create dataframe of basis functions for splines of continuous variables
cont_dat <- list()
for(i in seq_len(length(cont_bm))){
cont_dat[[paste0("mod_", cont_bm[i], "1")]] <- rcs(imp_dat[[cont_bm[[i]]]], 3)[,1]
cont_dat[[paste0("mod_", cont_bm[i], "2")]] <- rcs(imp_dat[[cont_bm[[i]]]], 3)[,2]
}
for(i in seq_len(length(Ind_bm))){
cont_dat[[paste0("Ind_", Ind_bm[i], "1")]] <- rcs(imp_dat[[Ind_bm[[i]]]], 3)[,1]
cont_dat[[paste0("Ind_", Ind_bm[i], "2")]] <- rcs(imp_dat[[Ind_bm[[i]]]], 3)[,2]
}
cont_dat <- as.data.frame(cont_dat)
# Save knot locations for External Validation
if(!file.exists("/stash/results/dev/apfela/P02809_Melanoma_CompositeBM_Meta-analysis/NoY_NoPriorI_NoSimoncont_dat.rds")) {
saveRDS(cont_dat, file.path(results, paste0(lab_yimp, lab_priorI, lab_simon, "cont_dat.rds")))
}
Because the goal of this analysis is to identify biomarkers which we think are likely to have a different association with Primary Resistance depending on treatment, it is sometimes beneficial to parametrize the treatment variable as +1/-1 instead of the standard dummy coding. See (Tian et al., 2014). In this particular analysis we tried both parameterizations and present the model with standard dummy coding. See the corresponding vignette with Survival outcomes for an example with the +1/-1 coding.
# Reparameterize treatment to +1/-1 to apply Tibshirani/Simon method
if (simon == "Yes") {
imp_dat$TRT2 <- with(imp_dat, ifelse(ARM2 == "TREATMENT", 1,
ifelse(ARM2 == "PLACEBO", -1, NA)))
} else if (simon == "No"){
imp_dat$TRT2 <- with(imp_dat, ifelse(ARM2 == "TREATMENT", 1,
ifelse(ARM2 == "PLACEBO", 0, NA)))
}
The software for Group Lasso (from the grpreg package)
requires you to provide all predictors in matrix form with all
categorical variables to be coded as 0/1. Thus we make the necessary
transformations below.
Note that for categorical variables with >2 levels (as is the case for Cat04 and Cat05), we must create 2 distinct dummy variables and then we will group them together as we do with the basis functions for continuous variables and the indicator variables for Cont09.
# Set up clinical covariates for model
imp_dat <- mutate(imp_dat, cat_Cat01 = as.numeric(as.factor(Cat01)) - 1)
imp_dat <- mutate(imp_dat, cat_Cat03 = as.numeric(as.factor(Cat03)) - 1)
imp_dat <- mutate(imp_dat, Cat04_TypeB = ifelse(Cat04 == "TypeB", 1, 0))
imp_dat <- mutate(imp_dat, Cat04_TypeC = ifelse(Cat04 == "TypeC", 1, 0))
imp_dat <- mutate(imp_dat, Cat04_TypeD = ifelse(Cat04 == "TypeD", 1, 0))
imp_dat <- mutate(imp_dat, Cat05_TypeB = ifelse(Cat05 == "TypeB", 1, 0))
imp_dat <- mutate(imp_dat, Cat05_TypeC = ifelse(Cat05 == "TypeC", 1, 0))
imp_dat <- mutate(imp_dat, cat_Cat08 = as.numeric(as.factor(Cat08)) - 1)
imp_dat <- mutate(imp_dat, cat_Cat07 = Cat07)
imp_dat <- mutate(imp_dat, cat_Cat06 = Cat06)
imp_dat <- mutate(imp_dat, cat_Cat02 = Cat02)
mod_Cat04 <- c("Cat04_TypeB", "Cat04_TypeC", "Cat04_TypeD")
mod_Cat05 <- c("Cat05_TypeB", "Cat05_TypeC")
mod_clin <- c("cat_Cat01", "cat_Cat03", "cat_Cat08", "cat_Cat07", "cat_Cat06", "cat_Cat02")
mod_trt <- "TRT2"
# Dataframe to be used for model: Keep only model-ready variables
mod_dat <- dplyr::select(imp_dat, mod_clin, mod_Cat04, mod_Cat05,
mod_trt, "Outcome", "Ind_Cont09_s_Ind")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mod_clin)
##
## # Now:
## data %>% select(all_of(mod_clin))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mod_Cat04)
##
## # Now:
## data %>% select(all_of(mod_Cat04))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mod_Cat05)
##
## # Now:
## data %>% select(all_of(mod_Cat05))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mod_trt)
##
## # Now:
## data %>% select(all_of(mod_trt))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
mod_dat <- cbind(mod_dat, cont_dat)
# Remove extra levels from factors in dataset
mod_dat <- droplevels(mod_dat)
Next we have to manually create the interactions between each of our predictors and treatment. Note that all predictors with the exception of 2-level categorical predictors have multiple columns associated with them. We need to multiply every column by the treatment variable to make the interaction. We will later group these interaction terms which each respective predictor for the group lasso penalty.
# Set up data to apply Noah Simon's group lasso method: Manually make interactions
# Continuous variables
bm <- str_subset(colnames(mod_dat), "mod_")
mod_dat <- mutate_at(mod_dat, bm, list(modINT = ~. * TRT2))
INT_bm <- str_subset(colnames(mod_dat), "_modINT")
# MultiLevel categorical variables
Cat04 <- str_subset(colnames(mod_dat), "Cat04_")
mod_dat <- mutate_at(mod_dat, Cat04, list(Cat04INT = ~. * TRT2))
INT_Cat04 <- str_subset(colnames(mod_dat), "_Cat04INT")
Cat05 <- str_subset(colnames(mod_dat), "Cat05_")
mod_dat <- mutate_at(mod_dat, Cat05, list(Cat05INT = ~. * TRT2))
INT_Cat05 <- str_subset(colnames(mod_dat), "_Cat05INT")
# Binary Categorical variables
clin <- str_subset(colnames(mod_dat), "cat_")
mod_dat <- mutate_at(mod_dat, clin, list(catINT = ~. * TRT2))
INT_cat <- str_subset(colnames(mod_dat), "_catINT")
# Indicator function variables
Ind <- c("Ind_Cont09_s_Ind", "Ind_Cont09_s1", "Ind_Cont09_s2")
mod_dat <- mutate_at(mod_dat, Ind, list(IndINT = ~. * TRT2))
INT_Ind <- c("Ind_Cont09_s_Ind_IndINT", "Ind_Cont09_s1_IndINT", "Ind_Cont09_s2_IndINT")
# List of all variables to be included in model
mod_var <- c(mod_trt, clin, "Cat04", "Cat05", cont_bm, Ind_bm)
Now that the data is finally in shape, we are almost ready to run our model. When working with penalized regression, which typically requires some kind of tuning parameter selection via Cross-Validation (CV), we recommend running the CV numerous times in order to not have to rely on one arbitrary split of the data. Thus we will select the tuning parameter which has the minimum median error across the numerous runs.
To implement this recommendation, we make use of a wrapper function,
rep_grpreg, which can be found at https://github.com/apfela2/Utility_Functions/blob/main/code/R/grpLasso/rep_grpreg.R
The rep_grpreg function contains the following
arguments:
| Parameter | Details |
|---|---|
| data | dataframe |
| x | Vector of predictor names to be placed in the model - IMPORTANT!!! Pay attention to order of predictors so that groups are assigned correctly |
| y | Name of outcome variable |
| group | Vector of group number assigned to each predictor |
| family | Distribution of outcome. Currently only set up for “binomial” |
| penalty | See grpreg documentation |
| lambda_seq | Vector of proposed lambda values, recommend to keep NULL |
| nfold | Number of folds used in inner loop of CV used for tuning parameter selection |
| nreps | Number of times to repeat the CV procedure |
| alpha | Tuning parameter for elastic net type penalty (how much weight to put on L1 as opposed to L2) |
| ncores_rep | Number of cores to use if parallelizing |
# Sanity check to make sure groups are properly assigned
tmp <- dplyr::select(mod_dat, c("TRT2", clin, INT_cat, Cat04, INT_Cat04, Cat05, INT_Cat05, bm, INT_bm, Ind, INT_Ind))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(clin)
##
## # Now:
## data %>% select(all_of(clin))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(INT_cat)
##
## # Now:
## data %>% select(all_of(INT_cat))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(Cat04)
##
## # Now:
## data %>% select(all_of(Cat04))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(INT_Cat04)
##
## # Now:
## data %>% select(all_of(INT_Cat04))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(Cat05)
##
## # Now:
## data %>% select(all_of(Cat05))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(INT_Cat05)
##
## # Now:
## data %>% select(all_of(INT_Cat05))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(bm)
##
## # Now:
## data %>% select(all_of(bm))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(INT_bm)
##
## # Now:
## data %>% select(all_of(INT_bm))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(Ind)
##
## # Now:
## data %>% select(all_of(Ind))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(INT_Ind)
##
## # Now:
## data %>% select(all_of(INT_Ind))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
grp <- c(0, rep(1:length(clin), times = 2), rep(length(clin) + 1, times = length(Cat04)*2),
rep(length(clin) + 2, times = length(Cat05)*2),
rep(seq(length(clin) + 3, length(clin) + 2 + length(cont_bm)), each = 2, times = 2),
rep(length(clin) + 2 + length(cont_bm) + 1, each = 3, times = 2))
cbind(colnames(tmp), grp)
## grp
## [1,] "TRT2" "0"
## [2,] "cat_Cat01" "1"
## [3,] "cat_Cat03" "2"
## [4,] "cat_Cat08" "3"
## [5,] "cat_Cat07" "4"
## [6,] "cat_Cat06" "5"
## [7,] "cat_Cat02" "6"
## [8,] "cat_Cat01_catINT" "1"
## [9,] "cat_Cat03_catINT" "2"
## [10,] "cat_Cat08_catINT" "3"
## [11,] "cat_Cat07_catINT" "4"
## [12,] "cat_Cat06_catINT" "5"
## [13,] "cat_Cat02_catINT" "6"
## [14,] "Cat04_TypeB" "7"
## [15,] "Cat04_TypeC" "7"
## [16,] "Cat04_TypeD" "7"
## [17,] "Cat04_TypeB_Cat04INT" "7"
## [18,] "Cat04_TypeC_Cat04INT" "7"
## [19,] "Cat04_TypeD_Cat04INT" "7"
## [20,] "Cat05_TypeB" "8"
## [21,] "Cat05_TypeC" "8"
## [22,] "Cat05_TypeB_Cat05INT" "8"
## [23,] "Cat05_TypeC_Cat05INT" "8"
## [24,] "mod_Cont01_s1" "9"
## [25,] "mod_Cont01_s2" "9"
## [26,] "mod_Cont02_s1" "10"
## [27,] "mod_Cont02_s2" "10"
## [28,] "mod_Cont03_s1" "11"
## [29,] "mod_Cont03_s2" "11"
## [30,] "mod_Cont04_s1" "12"
## [31,] "mod_Cont04_s2" "12"
## [32,] "mod_Cont05_s1" "13"
## [33,] "mod_Cont05_s2" "13"
## [34,] "mod_Cont06_s1" "14"
## [35,] "mod_Cont06_s2" "14"
## [36,] "mod_Cont07_s1" "15"
## [37,] "mod_Cont07_s2" "15"
## [38,] "mod_Cont08_s1" "16"
## [39,] "mod_Cont08_s2" "16"
## [40,] "mod_Cont10_s1" "17"
## [41,] "mod_Cont10_s2" "17"
## [42,] "mod_Cont01_s1_modINT" "9"
## [43,] "mod_Cont01_s2_modINT" "9"
## [44,] "mod_Cont02_s1_modINT" "10"
## [45,] "mod_Cont02_s2_modINT" "10"
## [46,] "mod_Cont03_s1_modINT" "11"
## [47,] "mod_Cont03_s2_modINT" "11"
## [48,] "mod_Cont04_s1_modINT" "12"
## [49,] "mod_Cont04_s2_modINT" "12"
## [50,] "mod_Cont05_s1_modINT" "13"
## [51,] "mod_Cont05_s2_modINT" "13"
## [52,] "mod_Cont06_s1_modINT" "14"
## [53,] "mod_Cont06_s2_modINT" "14"
## [54,] "mod_Cont07_s1_modINT" "15"
## [55,] "mod_Cont07_s2_modINT" "15"
## [56,] "mod_Cont08_s1_modINT" "16"
## [57,] "mod_Cont08_s2_modINT" "16"
## [58,] "mod_Cont10_s1_modINT" "17"
## [59,] "mod_Cont10_s2_modINT" "17"
## [60,] "Ind_Cont09_s_Ind" "18"
## [61,] "Ind_Cont09_s1" "18"
## [62,] "Ind_Cont09_s2" "18"
## [63,] "Ind_Cont09_s_Ind_IndINT" "18"
## [64,] "Ind_Cont09_s1_IndINT" "18"
## [65,] "Ind_Cont09_s2_IndINT" "18"
PARAMS <- list(dat = mod_dat,
x = colnames(tmp),
y = "Outcome",
group = grp,
penalty = "grLasso",
lambda_seq = NULL,
nreps = 50, #inner loop of cv to fit lambda
nfold = 5, #inner loop of cv to fit lambda
# nlambda = 100, #number of lambdas to evaluate
# lambda.type = "min", #which lambda to select from based on inner loop cv to apply to full model within each outer loop (not full-full model if there's outer loop present)
# num_outer_rep = 100, #number of repeats for outer loop for performance estimation
# outer_cv_nfolds = 5, #number of folds for outer loop of cross-validation
font_size = 18,
# run.parallel = T,
ncores_rep = max(1, (detectCores() - 1)),
# response = surv.resp,
# cont_X = str_subset(colnames(dat), "_Q"),
alpha = 0.95,
verbose = T)
set.seed(46363)
if(!file.exists(file.path("/stash/results/dev/apfela/P02809_Melanoma_CompositeBM_Meta-analysis",
paste0(lab_yimp, lab_priorI, lab_simon, "full_grpreg_anonymized.rds")))) {
full_grpreg <- rep_grpreg(data = mod_dat, x = PARAMS$x, y = PARAMS$y, group = PARAMS$group,
penalty = PARAMS$penalty, lambda_seq = PARAMS$lambda_seq,
nreps = PARAMS$nreps, nfold = PARAMS$nfold, alpha = PARAMS$alpha,
ncores_rep = PARAMS$ncores_rep, family = "binomial")
saveRDS(full_grpreg, file = file.path(results, paste0(lab_yimp, lab_priorI, lab_simon, "full_grpreg_anonymized.rds")))
} else {
full_grpreg <- readRDS(file = file.path("/stash/results/dev/apfela/P02809_Melanoma_CompositeBM_Meta-analysis",
paste0(lab_yimp, lab_priorI, lab_simon, "full_grpreg_anonymized.rds")))
}
## Finished
The first useful visualization is to look at the distribution of CV Error across the range of lambda values.
You would like to see a figure which shows a decrease in error as lambda increases and at a certain influx point see the error increase as lambda increases. This point of influx would be the “best lambda”.
What you do NOT want to see is the CV Error curve plateau as lambda increases. This would tell us that the model with the highest value of lambda (i.e. the Null Model) is the best performing model. This implies the chosen model does not provide any meaningful information.
# Summarize results from full grpreg model
full_PR <- full_grpreg$full_model
# Set up data for CV Error vs lambda plot
err_PR <- as.data.frame(full_grpreg$cv_errors)
sum_err <- data.frame(err.median = apply(err_PR, 2, median), err.sd = apply(err_PR, 2, sd))
sum_err$index <- seq_len(nrow(sum_err))
sum_err$lambda <- full_grpreg$lambda_seq
sum_err$log_lambda <- log(full_grpreg$lambda_seq)
# Calculate number of groups with non-zero coefficients for each lambda
X <- PARAMS$dat[, PARAMS$x]
ngroups <- predict(full_PR, X = X, type = "ngroups")
# Prepare for plot
numcoef <- NULL
for(s in 1:ncol(full_PR$beta)) {
numcoef[s] <- ngroups[s]
}
sum_err <- cbind(sum_err, numcoef)
min_log_lambda <- log(full_grpreg$best_lambda)
min_lambda <- full_grpreg$best_lambda
index <- which.min(sum_err$err.median)
plot.title <- paste0("CV Curve for Median Errors Across ", PARAMS$nreps," Repeats \n grpreg model")
ggplot(sum_err, aes(x = log_lambda, y = err.median)) + geom_point() +
geom_errorbar(aes(ymin = err.median - err.sd, ymax = err.median + err.sd)) +
geom_text(mapping = aes(x = log_lambda, y = max(err.median + err.sd + ((max(err.median) - min(err.median)) / 10)), label = numcoef)) +
geom_vline(aes(xintercept = min_log_lambda), linetype = "dashed") +
labs(title = plot.title) +
theme_bw(PARAMS$font_size) +
ylab("Deviance") +
xlab("log(Lambda)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
In our example we see a nice dip followed by a steep rise in CV Error, a sign that there is helpful information provided by our model.
Lambda Trajectory plots help give an overview of the strength of the association between each of the variables in the model with the outcome and how that association changes across the range of lambda. In this application, we are interested in the overall association between each variable with outcome as well as the strength of the interaction between each variable and treatment with the outcome. Therefore we make 2 types of Lambda Trajectory plots. This is also useful forgetting a sense of how important each variable is relative to the others.
Overall Effect - We take the Group Norm (sqrt of the sum of the squares) of the coefficient from all terms in each “group” divided by the sqrt of the number of terms in that group.
Interaction Terms - We take the Group Norm of only the interaction terms from each group divided by the sqrt of the number of terms in that group.
We demonstrate 2 options to visualize these plots: the standard trajectory plot and a faceted version. The faceted version may be particularly useful when there is strong overlap in the trajectories of many variables.
At times it might be useful to zoom in in the event one variable distorts the scale of the y-axis. We illustrate this here as well. to be useful.
# Make manual lambda trajectory plot in order to customize
# Create dataframe of lambda and group norms
lambda_group <- predict(full_PR, X = X, type = "norm")
lambda_group <- as.data.frame(lambda_group)
# Fix column names so that lambda values aren't in scientific notation
colnames(lambda_group) <- round(full_PR$lambda, 7)
# Set up data for plotting
# Create dataframe mapping coefficients for variables to different lambda values
lambda_group <- rownames_to_column(lambda_group, var = "Group")
lambda_group <- mutate(lambda_group, Biomarker = mod_var)
lambda_group <- filter(lambda_group, Group != 0)
# Calculate number of terms in each group to appropriately calculate Scaled Norms
main_rank <- data.frame(Biomarker = lambda_group$Biomarker,
k = c(rep(2, length(clin)),
length(Cat04)*2, length(Cat05)*2,
rep(4, length(cont_bm)), rep(6, length(Ind_bm))))
# Rank biomarkers by Scaled Norm at selected lambda
main_rank <- mutate(main_rank,
Selected_Norm = lambda_group[,as.character(round(min_lambda, 7))])
main_rank <- mutate(main_rank,
Selected_Norm_s = Selected_Norm/sqrt(k))
# Reversing order of rank so that largest is first
main_rank <- mutate(main_rank,
Rank = nrow(main_rank) - rank(Selected_Norm_s,
ties.method = "random"))
main_rank <- dplyr::select(main_rank, Biomarker, Rank, Selected_Norm_s)
# Transpose such that each row has coefficient for different variable/lambda combination
lambda_group_l <- pivot_longer(lambda_group, -c(Group, Biomarker),
names_to = "Lambda", values_to = "Norm")
lambda_group_l$Lambda <- as.numeric(lambda_group_l$Lambda)
lambda_group_l <- mutate(lambda_group_l, log_lambda = log(Lambda))
# Calculate number of terms in each group to appropriately calculate Scaled Norms
tmp <- data.frame(Biomarker = lambda_group$Biomarker,
k = c(rep(2, length(clin)),
length(Cat04)*2, length(Cat05)*2,
rep(4, length(cont_bm)), rep(6, length(Ind_bm))))
lambda_group_l <- inner_join(lambda_group_l, tmp)
lambda_group_l <- mutate(lambda_group_l, Scaled_Norm = Norm/sqrt(k))
lambda_group_l <- inner_join(lambda_group_l, main_rank)
# Make plot with Lambda on x-axis
ggplot(lambda_group_l, aes(x = Lambda, y = Scaled_Norm, color = Biomarker)) +
scale_shape_manual(values = seq(65, length.out = (length(unique(lambda_group_l$Biomarker))))) +
# guides(shape = F) +
# scale_x_reverse() +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
geom_point(aes(shape = Biomarker), size = 3) +
ggtitle("Scaled Variables by 2*Standard Deviation",
subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
theme_bw(18)
# Zoomed In
ggplot(lambda_group_l, aes(x = Lambda, y = Scaled_Norm, color = Biomarker)) +
scale_shape_manual(values = seq(65, length.out = (length(unique(lambda_group_l$Biomarker))))) +
# guides(shape = F) +
# scale_x_reverse() +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
geom_point(aes(shape = Biomarker), size = 3) +
coord_cartesian(ylim = c(0, 2.5)) +
ggtitle("Scaled Variables by 2*Standard Deviation: Zoomed",
subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
theme_bw(18)
# Faceted version
ggplot(lambda_group_l, aes(x = Lambda, y = Scaled_Norm)) +
scale_shape_manual(values = seq(65, length.out = length(unique(lambda_group_l$Biomarker)))) +
geom_point(data = lambda_group_l %>%
dplyr::select(-Biomarker), colour = "grey", alpha = 0.7) +
geom_point(aes(color = Biomarker), color = "black") +
facet_wrap(reorder(Biomarker, Rank) ~ ., nrow = 4, ncol = 5) +
geom_text(aes(x = Inf, y = Inf, label = paste0("Selected Norm = ",
round(Selected_Norm_s, 2)),
size = 2, hjust = 1, vjust = 1)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Scaled Variables by 2*Standard Deviation", subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
geom_vline(xintercept = min_lambda, linetype = "dashed")
# Faceted version Zoomed In
ggplot(lambda_group_l, aes(x = Lambda, y = Scaled_Norm)) +
scale_shape_manual(values = seq(65, length.out = length(unique(lambda_group_l$Biomarker)))) +
geom_point(data = lambda_group_l %>%
dplyr::select(-Biomarker), colour = "grey", alpha = 0.7) +
geom_point(aes(color = Biomarker), color = "black") +
facet_wrap(reorder(Biomarker, Rank) ~ ., nrow = 4, ncol = 5) +
geom_text(aes(x = Inf, y = Inf, label = paste0("Selected Norm = ", round(Selected_Norm_s, 2)),
size = 2, hjust = 1, vjust = 2)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Scaled Variables by 2*Standard Deviation: Zoomed", subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
coord_cartesian(ylim = c(0, 2.5))
# Set up data to make lambda trajectory plot for only Interaction coefficients
mult_cat <- c("Cat04", "Cat05")
cat_var <- c("cat", mult_cat)
# Create vector of all categorical variable names used in model (to be used later when making PE plots)
cat_mod <- c(clin, Cat04, Cat05)
# Get coefficients across all lambdas
coef_lambda <- predict(full_PR, X = X, type = "coefficients")
coef_lambda <- as.data.frame(coef_lambda)
# Fix column names so that lambda values aren't in scientific notation
colnames(coef_lambda) <- round(full_PR$lambda, 7)
coef_lambda <- rownames_to_column(coef_lambda, var = "Var")
coef_lambda <- mutate(coef_lambda, Type = ifelse(str_detect(Var, pattern = "INT"),
"Interaction", "Main_Effect"))
coef_lambda <- filter(coef_lambda, Var != "TRT2")
coef_lambda <- filter(coef_lambda, Var != "(Intercept)")
coef_lambda <- mutate(coef_lambda,
CAT = ifelse(str_detect(Var, pattern = paste(cat_var, collapse = "|")), 1, 0 ))
# Simplify Variable Names such that single name covers all columns for givevn variable
coef_lambda <- mutate(coef_lambda,
BM = ifelse(CAT == 0 & Type == "Main_Effect" & !str_detect(Var, pattern = "_Ind"),
str_sub(Var, 1, -2),
ifelse(CAT == 0 & Type == "Main_Effect" & str_detect(Var, pattern = "_Ind"),
str_sub(Var, 1, -5),
ifelse(CAT == 0 & Type == "Interaction" & !str_detect(Var, pattern = "_Ind_"),
str_sub(Var, 1, -9),
ifelse(CAT == 0 & Type == "Interaction" & str_detect(Var, pattern = "_Ind_"),
str_sub(Var, 1, -12),
ifelse(CAT == 1 & Type == "Interaction" & !str_detect(Var, pattern = paste(mult_cat, collapse = "|")),
str_sub(Var, 1, -8),
ifelse(CAT == 1 & Type == "Interaction" & str_detect(Var, pattern = paste(mult_cat, collapse = "|")),
str_sub(Var, 1, -10), Var)))))))
# Remove prefixes of continuous variables
coef_lambda <- mutate(coef_lambda, BM2 = ifelse(CAT == 0, str_sub(BM, 5), BM))
# Only keep interaction coefficients
coef_Int <- filter(coef_lambda, Type == "Interaction")
# Remove unnecessary columns
coef_Int <- dplyr::select(coef_Int, -CAT)
# Collapse Multilevel categorical variables into one name
coef_Int <- mutate(coef_Int, group = ifelse(str_detect(Var, pattern = "Cat04"), "Cat04",
ifelse(str_detect(Var, pattern = "Cat05"), "Cat05", BM2)))
# Calculate number of terms in each group for scaling purposes
coef_Int <- coef_Int %>% group_by(group) %>%
mutate(k = length(BM)) %>%
ungroup()
# Remove unnecessary columns
coef_Int <- dplyr::select(coef_Int, -BM, -BM2, -Type)
# Transpose data to have column for lambda
coef_Int_l <- pivot_longer(coef_Int, -c(Var, group, k), names_to = "Lambda", values_to = "coef")
coef_Int_l$Lambda <- as.numeric(coef_Int_l$Lambda)
coef_Int_l <- mutate(coef_Int_l, log_lambda = log(Lambda))
# Calculate scaled norms
coef_Int_l <- coef_Int_l %>%
group_by(Lambda, group) %>%
mutate(Scaled_Norm = sqrt(sum(coef^2)/k)) %>%
distinct(group, .keep_all = T) %>% # Remove duplicate rows of variable within a group
arrange(Lambda, group, Var) %>%
ungroup()
# Rename group to more intuitive name
coef_Int_l <- rename(coef_Int_l, Biomarker = group)
# Rank biomarkers by Scaled Norm at selected lambda
Int_rank <- filter(coef_Int_l, Lambda == round(min_lambda, 7))
# Reversing order of rank so that largest is first
Int_rank <- mutate(Int_rank,
Rank = nrow(Int_rank) - rank(Scaled_Norm, ties.method = "random"))
Int_rank <- mutate(Int_rank, Selected_Norm = Scaled_Norm)
Int_rank <- dplyr::select(Int_rank, Biomarker, Rank, Selected_Norm)
coef_Int_l <- inner_join(coef_Int_l, Int_rank)
# Make plot with Lambda on x-axis
ggplot(coef_Int_l, aes(x = Lambda, y = Scaled_Norm, color = Biomarker)) +
scale_shape_manual(values = seq(65, length.out = (length(unique(coef_Int_l$Biomarker))))) +
# guides(shape = F) +
# scale_x_reverse() +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
geom_point(aes(shape = Biomarker), size = 3) +
ggtitle("Interaction Coefficients", subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
theme_bw(18)
# Zoomed In
ggplot(coef_Int_l, aes(x = Lambda, y = Scaled_Norm, color = Biomarker)) +
scale_shape_manual(values = seq(65, length.out = (length(unique(coef_Int_l$Biomarker))))) +
# guides(shape = F) +
# scale_x_reverse() +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
geom_point(aes(shape = Biomarker), size = 3) +
coord_cartesian(ylim = c(0, 1.5)) +
ggtitle("Interaction Coefficients: Zoomed", subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
theme_bw(18)
# Faceted version
ggplot(coef_Int_l, aes(x = Lambda, y = Scaled_Norm)) +
scale_shape_manual(values = seq(65, length.out = length(unique(coef_Int_l$Biomarker)))) +
geom_point(data = coef_Int_l %>%
dplyr::select(-Biomarker), colour = "grey", alpha = 0.7) +
geom_point(aes(color = Biomarker), color = "black") +
facet_wrap(reorder(Biomarker, Rank) ~ ., nrow = 4, ncol = 5) +
geom_text(aes(x = Inf, y = Inf, label = paste0("Selected Norm = ", round(Selected_Norm, 2)),
size = 2, hjust = 1, vjust = 1)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Interaction Coefficients", subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
geom_vline(xintercept = min_lambda, linetype = "dashed")
# Faceted version Zoomed
ggplot(coef_Int_l, aes(x = Lambda, y = Scaled_Norm)) +
scale_shape_manual(values = seq(65, length.out = length(unique(coef_Int_l$Biomarker)))) +
geom_point(data = coef_Int_l %>%
dplyr::select(-Biomarker), colour = "grey", alpha = 0.7) +
geom_point(aes(color = Biomarker), color = "black") +
facet_wrap(reorder(Biomarker, Rank) ~ ., nrow = 4, ncol = 5) +
geom_text(aes(x = Inf, y = Inf, label = paste0("Selected Norm = ", round(Selected_Norm, 2)),
size = 2, hjust = 1, vjust = 2)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Interaction Coefficients: Zoomed",
subtitle = paste0("Selected Lambda = ", round(min_lambda, 4))) +
geom_vline(xintercept = min_lambda, linetype = "dashed") +
coord_cartesian(ylim = c(0, 1.5))
In conjunction to seeing the strength of association between each variable and outcome, it is useful to see what is the shape of the association. Due to the implementation of splines we cannot rely on conventional forest plots for continuous variables as they assume a consistent effect across all values of the variable of interest.
The associations for Categorical variables on the other hand can be summarized by the more familiar forest plots. We also demonstrate a scatter plot which provides similar information.
# Partial Effects Plots
# Get prediction matrix for selected lambda to use in PE plots
coef <- predict(full_PR, X = X, type = "coefficients", which = index)
coef <- rownames_to_column(as.data.frame(coef), var = "Var")
coef <- rename(coef, coef = as.character(round(full_grpreg$best_lambda, 4))) # The selected lambda
coef <- mutate(coef, Type = ifelse(str_detect(Var, pattern = "INT"), "Interaction", "Main_Effect"))
coef <- filter(coef, Var != "TRT2")
coef <- filter(coef, Var != "(Intercept)")
coef <- mutate(coef, CAT = ifelse(str_detect(Var, pattern = paste(cat_var, collapse = "|")), 1, 0 ))
# Simplify Variable Names such that single name covers all columns for givevn variable
coef <- mutate(coef, BM = ifelse(CAT == 0 & Type == "Main_Effect" &
!str_detect(Var, pattern = "_Ind"), str_sub(Var, 1, -2),
ifelse(CAT == 0 & Type == "Main_Effect" &
str_detect(Var, pattern = "_Ind"), str_sub(Var, 1, -5),
ifelse(CAT == 0 & Type == "Interaction" &
!str_detect(Var, pattern = "_Ind_"),
str_sub(Var, 1, -9),
ifelse(CAT == 0 & Type == "Interaction" &
str_detect(Var, pattern = "_Ind_"),
str_sub(Var, 1, -12),
ifelse(CAT == 1 & Type == "Interaction" &
!str_detect(Var, pattern = paste(mult_cat, collapse = "|")),
str_sub(Var, 1, -8),
ifelse(CAT == 1 & Type == "Interaction" &
str_detect(Var, pattern = paste(mult_cat, collapse = "|")),
str_sub(Var, 1, -10), Var)))))))
# Remove variables which were filtered out
coef <- filter(coef, coef != 0)
###########################
# Categorical
# Set up data for scatter plot of categorical coefficients
coef_CAT <- filter(coef, CAT == 1) %>% dplyr::select(-CAT)
coef_CAT <- mutate(coef_CAT, group = ifelse(str_detect(Var, pattern = "Cat04"), "Cat04",
ifelse(str_detect(Var, pattern = "Cat05"), "Cat05", BM)))
# Calculate scaled norms
coef_CAT <- coef_CAT %>% group_by(group, Type) %>%
mutate(k = length(coef)) %>%
mutate(Norm = sqrt(sum(coef^2)/k)) %>%
ungroup()
# Transpose to calculate contrasts
coef_cat_w <- pivot_wider(coef_CAT, id_cols = c(BM, group), names_from = Type, values_from = coef)
# Calculate contrasts
if(simon == "Yes"){
coef_cat_w <- mutate(coef_cat_w, Placebo = Main_Effect - Interaction)
coef_cat_w <- mutate(coef_cat_w, Treatment = Main_Effect + Interaction)
coef_cat_w <- mutate(coef_cat_w, Placebo_HR = exp(Placebo))
coef_cat_w <- mutate(coef_cat_w, Treatment_HR = exp(Treatment))
} else if(simon == "No") {
coef_cat_w <- mutate(coef_cat_w, Placebo = Main_Effect)
coef_cat_w <- mutate(coef_cat_w, Treatment = Main_Effect + Interaction)
coef_cat_w <- mutate(coef_cat_w, Placebo_HR = exp(Placebo))
coef_cat_w <- mutate(coef_cat_w, Treatment_HR = exp(Treatment))
}
ggplot(coef_cat_w, aes(x = Placebo, y = Treatment, shape = BM, thick = 5)) +
# facet_grid(Model ~ Test) +
# geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey") +
# annotate("text", x = -0.5, y = -log10(0.05) + 0.2, label = "p = 0.05", color = "grey") +
# annotate("text", angle = 270, x = -log10(0.05) + 0.2, y = 3.5, label = "p = 0.05", color = "grey") +
# geom_vline(xintercept = -log10(0.05), linetype = "dashed", color = "grey") +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey") +
geom_point(size = 3, stroke = 2) +
scale_shape_manual(values=c(65:(64+length(unique(coef_cat_w$BM))))) +
ggtitle("Predicted log(Odds Ratio) for Being Primary Resistant") +
ylim(-0.4, 0.8) +
xlim(-0.4, 0.8) +
coord_fixed() +
geom_abline(color = "grey") +
theme_bw(18)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
coef_groups <- predict(full_PR, X = X, type = "norm", which = index)
names(coef_groups) <- mod_var
coef_groups <- rownames_to_column(as.data.frame(coef_groups), var = "Var")
# Set up data for Table
coef_cat_table <- dplyr::select(coef_CAT, group, Type, k, Norm)
coef_cat_table <- unique(coef_cat_table)
coef_cat_table <- pivot_wider(coef_cat_table, id_cols = c(group, k), names_from = Type, values_from = Norm)
coef_cat_table <- inner_join(coef_cat_table, coef_groups, by = c("group" = "Var"))
coef_cat_table <- rename(coef_cat_table, Group = coef_groups)
coef_cat_table <- mutate(coef_cat_table, Group = Group/sqrt(k))
# Make dot plot of OR for categorical variables
coef_cat_w <- pivot_longer(coef_cat_w, c(Placebo, Treatment), names_to = "Treatment",
values_to = "log_OR")
coef_cat_w <- mutate(coef_cat_w, log2_OR = log_OR*log2(exp(1)))
ggplot(coef_cat_w, aes(y = BM, x = log2_OR, color = Treatment, shape = Treatment)) +
geom_point(size = 5) +
scale_shape_discrete(labels=c("Treatment", "Placebo")) +
scale_linetype_discrete(labels=c("Treatment", "Placebo")) +
scale_color_manual(values = c("red", "blue"), labels = c("Treatment", "Placebo")) +
geom_vline(xintercept = 0, linetype = "dashed") +
ylab("Biomarker") +
xlab("Log2 OR") +
theme_bw(18) +
ggtitle("Log2 OR for Having Outcome")
Partial Effects (PE) plots are particularly important when using splines in the model for continuous variables since the relationship can no longer be summarized simply by the sign of the coefficient and its slope. Rather, since there is no linearity assumption we must visualize the whole relationship through Partial Effects plots.
In order to make PE plots, we have to make predictions across a range of values for each biomarker for each treatment while keeping everything else constant. In general we do not recommend predicting the entire range of observed values because outliers will distort the x-axis.In this case study we make predictions for the middle 90% of the observed data.
When working with splines, it is important to apply the knots at the same locations as they were in the training model.
In the plot below we combine both continuous and categorical variables onto a single plot. We found that it can be confusing for some audiences to visualize the categorical and continuous variables on different figures and therefore this was a way to avoid that confusion. See the grpLasso Vignette with Survival Outcome (https://github.com/apfela2/Vignettes/Vignette_grpLasso_Survival) for an example of a PE plot with only continuous variables.
NOTE: One very important bug to be wary of when making predictions from grpreg models is that it assumes the variables in the newdata are in the same order as they were in the training data. If the order changes you will get wrong predictions WITHOUT WARNING!
# Continuous
# Set up data for scatterplot of continuous coefficients
coef_cont <- filter(coef, CAT == 0) %>% dplyr::select(-CAT)
coef_cont <- mutate(coef_cont, BM = str_sub(BM, 5, -1))
# Calculate Scaled Norms
coef_cont <- coef_cont %>% group_by(BM, Type) %>%
mutate(k = length(coef)) %>%
mutate(Norm = sqrt(sum(coef^2)/k)) %>%
ungroup()
# Create Table
cont_coef_table <- coef_cont
cont_coef_table <- dplyr::select(cont_coef_table, Var, BM, coef, Type)
cont_coef_table <- pivot_wider(cont_coef_table, id_cols = c(Var, BM),
names_from = Type, values_from = coef)
cont_coef_table <- mutate(cont_coef_table, Main_Effect = round(Main_Effect, 3))
cont_coef_table <- mutate(cont_coef_table, Interaction = round(Interaction, 3))
# tmp <- dplyr::select(cont_coef_table, Interaction)
coef_cont <- dplyr::select(coef_cont, BM, Type, Norm, k)
coef_cont <- unique(coef_cont)
# Set up for PE plot
coef_cont_w <- pivot_wider(coef_cont, id_cols = c(BM, k), names_from = Type, values_from = Norm)
coef_cont_w <- inner_join(coef_cont_w, coef_groups, by = c("BM" = "Var"))
coef_cont_w <- rename(coef_cont_w, Group = coef_groups)
coef_cont_w <- mutate(coef_cont_w, Group = Group/sqrt(k))
##########################################
# Set up data for partial effects plots
##########################################
# Identify knot locations for each continuous covariate
# cont_dat_s1 <- select(cont_dat, str_subset(colnames(cont_dat), "s1"))
# Sanity checks
attributes(cont_dat$mod_Cont01_s1)$parms
## [1] -0.3851069 -0.1084560 0.4320535
attributes(cont_dat$Ind_Cont09_s1)$parms
## [1] -0.2459660 -0.2212533 0.4954145
# Interpolate, and apply model knot locations to create splines
cont_raw <- dplyr::select(imp_dat, cont_bm, Ind_bm)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(Ind_bm)
##
## # Now:
## data %>% select(all_of(Ind_bm))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
meds <- sapply(cont_raw, median)
nintp <- 200
pred_dat <- list()
pred_dat_names <- list()
# Make same names as in model data
for(i in seq_len(length(cont_bm))){
pred_dat_names[[i]] <- c(paste0("mod_", cont_bm[i], "1"), paste0("mod_", cont_bm[i], "2"))
}
# Start with only cont versions of Cont09 and we'll add Ind versions later
for(i in seq_len(length(Ind_bm))){
pred_dat_names[[i + length(cont_bm)]] <- c(paste0("Ind_", Ind_bm[i], "1"),
paste0("Ind_", Ind_bm[i], "2"))
}
pred_dat_names <- unlist(pred_dat_names)
# Create vector of ALL cont bm
all_bm <- c(cont_bm, Ind_bm)
# For 1st BM we create 2 (one for each ARM) sequences from 10% to 90% followed by median for remaining BM
# For 2nd BM and on we start with median for preceding BM followed by sequence
for(i in seq_len(length(cont_bm))){
if(i == 1) {
pred_dat[[cont_bm[i]]] <- c(rep(seq(quantile(imp_dat[[cont_bm[i]]], 0.05),
quantile(imp_dat[[cont_bm[i]]], 0.95),
length.out = nintp), 2),
rep(meds[i], (length(all_bm)-i)*nintp*2))
} else {
pred_dat[[cont_bm[i]]] <- c(rep(meds[i], (i-1)*nintp*2),
rep(seq(quantile(imp_dat[[cont_bm[i]]], 0.05),
quantile(imp_dat[[cont_bm[i]]], 0.95),
length.out = nintp), 2), rep(meds[i],
(length(all_bm)-i)*nintp*2))
}
# Create 2 basis functions from interpolated data using same knots as in model
# NOTE: This code assumes there were only 3 knots in model for all continuous variables
pred_dat[[paste0("mod_", cont_bm[i], "1")]] <- rcs(pred_dat[[cont_bm[[i]]]],
attributes(cont_dat[[paste0("mod_", cont_bm[i], "1")]])$parms,
name = pred_dat_names[(2*i) - 1])[,1]
attr(pred_dat[[paste0("mod_", cont_bm[i], "1")]], "dimnames") <- NULL
pred_dat[[paste0("mod_", cont_bm[i], "2")]] <- rcs(pred_dat[[cont_bm[[i]]]],
attributes(cont_dat[[paste0("mod_", cont_bm[i], "2")]])$parms,
name = pred_dat_names[2*i])[,2]
attr(pred_dat[[paste0("mod_", cont_bm[i], "2")]], "dimnames") <- NULL
}
# For Cont09 treat as 2nd BM or later from previous chunk
for(i in seq_len(length(Ind_bm))){
j <- length(cont_bm) + i # Convenience to recalibrate Ind_bm to cont_bm
pred_dat[[Ind_bm[i]]] <- c(rep(meds[j], (j - 1)*nintp*2),
rep(seq(quantile(imp_dat[[Ind_bm[i]]], 0.05),
quantile(imp_dat[[Ind_bm[i]]], 0.95),
length.out = nintp), 2),
rep(meds[j], (length(all_bm)-j)*nintp*2))
pred_dat[[paste0("Ind_", Ind_bm[i], "1")]] <- rcs(pred_dat[[Ind_bm[[i]]]],
attributes(cont_dat[[paste0("Ind_", Ind_bm[i], "1")]])$parms,
name = pred_dat_names[(2*j) - 1])[,1]
attr(pred_dat[[paste0("Ind_", Ind_bm[i], "1")]], "dimnames") <- NULL
pred_dat[[paste0("Ind_", Ind_bm[i], "2")]] <- rcs(pred_dat[[Ind_bm[[i]]]],
attributes(cont_dat[[paste0("Ind_", Ind_bm[i], "2")]])$parms,
name = pred_dat_names[2*j])[,2]
attr(pred_dat[[paste0("Ind_", Ind_bm[i], "2")]], "dimnames") <- NULL
}
pred_dat <- as.data.frame(pred_dat)
# Create Indicator functions for Cont09
pred_dat <- mutate(pred_dat, Ind_Cont09_s_Ind = ifelse(Cont09_s == min(imp_dat$Cont09_s), 1, 0))
# Keep only terms used in model
pred_dat <- dplyr::select(pred_dat, str_subset(colnames(pred_dat), "mod"),
str_subset(colnames(pred_dat), "Ind"))
# Create TRT2
if(simon == "Yes") {
pred_dat$TRT2 <- rep(c(1, -1), each = nintp, times = length(all_bm))
} else if(simon == "No"){
pred_dat$TRT2 <- rep(c(1, 0), each = nintp, times = length(all_bm))
}
# Create categorical variables set to reference levels
for(var in cat_mod){
pred_dat[[var]] <- 0
}
# Create interactions for all variables
pred_dat <- mutate_at(pred_dat, bm, list(modINT = ~. * TRT2))
pred_dat <- mutate_at(pred_dat, Cat04, list(Cat04INT = ~. * TRT2))
pred_dat <- mutate_at(pred_dat, Cat05, list(Cat05INT = ~. * TRT2))
pred_dat <- mutate_at(pred_dat, clin, list(catINT = ~. * TRT2))
pred_dat <- mutate_at(pred_dat, Ind, list(IndINT = ~. * TRT2))
#################################
# Get predictions
###################################
# !!!! BUG IN predict.grpsurv !!!!
# Must have prediction matrix IN SAME ORDER as training data! If not, results will be INCORRECT WITH NO WARNING!!!
# Ensure prediction matrix is same order as training data
pred_dat <- dplyr::select(pred_dat, PARAMS$x)
pred_mat <- as.matrix(pred_dat)
# Get Predictions
pred_dat$preds <- predict(full_PR, X = pred_mat, type = "link", which = index)
# Identify which BM is getting predicted
pred_dat$BM <- rep(all_bm, each = 2*nintp)
#################################
# Make partial effects plots
##################################
raw_bm <- sub("_s$", "", all_bm)
plot_dat <- list()
# Create dataframe of variables on original scale, but mapping predictions
for(i in seq_len(length(raw_bm))) {
plot_dat[[raw_bm[i]]] <- rep(seq(quantile(red_dat[[raw_bm[i]]], 0.05, na.rm = T),
quantile(red_dat[[raw_bm[i]]], 0.95, na.rm = T),
length.out = nintp), 2)
}
plot_dat <- as.data.frame(plot_dat)
plot_dat$TRT <- rep(c("Treatment", "Placebo"), each = nintp)
plot_dat <- pivot_longer(plot_dat, -TRT, names_to = "Biomarker", values_to = "Value")
# Rearrange in order to match up order with pred_dat
plot_dat <- arrange(plot_dat, match(Biomarker, raw_bm), TRT, Value)
# Since everything matches, we transfer preds
plot_dat$preds <- pred_dat$preds
plot_dat <- mutate(plot_dat, log2_preds = preds*log2(exp(1)))
# Identify filtered BM
filtered <- mod_var[!(mod_var %in% mod_var[predict(full_PR, X = X, type = "groups", which = index)])]
filtered <- sub("_s$", "", filtered)
# Filter out biomarkers removed
plot_dat_filtered <- filter(plot_dat, !(Biomarker %in% filtered))
# Separate out point of discontinuity for better visualization
nuggets <- filter(plot_dat_filtered, Biomarker == "Cont09" & Value == min(red_dat$Cont09, na.rm = T))
nuggets <- mutate(nuggets, Value2 = Value - 1)
no_nugget <- filter(plot_dat_filtered, (Biomarker == "Cont09" &
Value != min(red_dat$Cont09, na.rm = T)) | Biomarker != "Cont09")
# Create dataframe for categorical variables with corresponding columns
coef_cat_w <- mutate(coef_cat_w,
BM2 = ifelse(str_detect(BM, pattern = "cat|TypeB|TypeB"), 1,
ifelse(str_detect(BM, pattern = "TypeC|TypeC"), 2, 3)))
cat_plot_dat <- data.frame(TRT = coef_cat_w$Treatment, Biomarker = coef_cat_w$group,
Value = coef_cat_w$BM2, preds = coef_cat_w$log_OR,
log2_preds = coef_cat_w$log2_OR)
# Combine with no_nuggets dataframe
combined_plot_dat <- rbind(no_nugget, cat_plot_dat)
# Set up ranking to combine
Int_rank2 <- Int_rank
Int_rank2 <- mutate(Int_rank2, Biomarker = gsub("_s", "", Biomarker))
combined_plot_dat <- inner_join(combined_plot_dat, Int_rank2)
combined_plot_dat$Biomarker <- reorder(combined_plot_dat$Biomarker, combined_plot_dat$Rank)
nuggets <- inner_join(nuggets, Int_rank2)
nuggets$Biomarker <- reorder(nuggets$Biomarker, nuggets$Rank)
cat_plot_dat <- inner_join(cat_plot_dat, Int_rank2)
cat_plot_dat$Biomarker <- reorder(cat_plot_dat$Biomarker, cat_plot_dat$Rank)
all_bm2 <- gsub("_s", "", all_bm)
tmp <- ggplot(combined_plot_dat, aes(x = Value, y = log2_preds, color = TRT, linetype = TRT)) +
geom_line(data = subset(combined_plot_dat, Biomarker == all_bm2)) +
geom_line(data = subset(combined_plot_dat, Biomarker == "Cont09")) + # To get around bug where some of Cont09 is not plotted
facet_wrap(~Biomarker, scales = "free_x") +
ylab("Log2 Relative Odds") +
xlab("Normalized Predictor Value") +
scale_shape_discrete(labels=c("Treatment", "Placebo")) +
scale_linetype_discrete(labels=c("Treatment", "Placebo")) +
scale_color_manual(values = c("red", "blue"), labels = c("Treatment", "Placebo")) +
theme_bw()
## Warning in `==.default`(Biomarker, all_bm2): longer object length is not a
## multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
tmp +
geom_point(data=nuggets,
aes(x=Value2,y=log2_preds, color = TRT, shape = TRT), size = 3) +
geom_point(data = cat_plot_dat,
aes(x = Value, y = log2_preds, color = TRT), inherit.aes = F) +
geom_hline(yintercept = 0, linetype = "dashed")
# scale_x_discrete()
Time required to process this report: 1.547321 mins
R session information:
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /opt/tbio/domino_202310/binaries/R-4.3.1/lib/R/lib/libRblas.so
## LAPACK: /opt/tbio/domino_202310/binaries/R-4.3.1/lib/R/lib/libRlapack.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 purrr_1.0.2 readr_2.1.4
## [5] tidyverse_2.0.0 grpreg_3.4.0 rmsb_1.0-0 rms_6.7-1
## [9] Hmisc_5.1-1 corrplot_0.92 bmsPURR_0.1.34 tidyr_1.3.0
## [13] tibble_3.2.1 stringr_1.5.0 ggplot2_3.4.3 dplyr_1.1.3
## [17] conflicted_1.2.0
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.15.0 jsonlite_1.8.7 CatMisc_1.1.5
## [4] magrittr_2.0.3 TH.data_1.1-2 farver_2.1.1
## [7] rmarkdown_2.25 fs_1.6.3 vctrs_0.6.3
## [10] memoise_2.0.1 askpass_1.2.0 base64enc_0.1-3
## [13] htmltools_0.5.6 usethis_2.2.2 polspline_1.1.23
## [16] Formula_1.2-5 sass_0.4.7 StanHeaders_2.26.28
## [19] bslib_0.5.1 htmlwidgets_1.6.2 desc_1.4.2
## [22] sandwich_3.0-2 zoo_1.8-12 cachem_1.0.8
## [25] DominoR_0.0.2 mime_0.12 lifecycle_1.0.3
## [28] pkgconfig_2.0.3 Matrix_1.6-1.1 R6_2.5.1
## [31] fastmap_1.1.1 shiny_1.7.5 digest_0.6.33
## [34] pingr_2.0.2 colorspace_2.1-0 ps_1.7.5
## [37] rprojroot_2.0.3 pkgload_1.3.3 labeling_0.4.3
## [40] timechange_0.2.0 fansi_1.0.4 httr_1.4.7
## [43] compiler_4.3.1 remotes_2.4.2.1 withr_2.5.1
## [46] htmlTable_2.4.1 backports_1.4.1 inline_0.3.19
## [49] QuickJSR_1.0.6 pkgbuild_1.4.2 MASS_7.3-60
## [52] quantreg_5.97 openssl_2.1.1 sessioninfo_1.2.2
## [55] myRepository_1.34.0 loo_2.6.0 tools_4.3.1
## [58] foreign_0.8-85 httpuv_1.6.11 nnet_7.3-19
## [61] glue_1.6.2 callr_3.7.3 nlme_3.1-163
## [64] promises_1.2.1 grid_4.3.1 checkmate_2.2.0
## [67] getPass_0.2-2 cluster_2.1.4 generics_0.1.3
## [70] ParamSetI_1.17.0 gtable_0.3.4 EventLogger_1.15.3
## [73] tzdb_0.4.0 data.table_1.14.8 hms_1.1.3
## [76] utf8_1.2.3 pillar_1.9.0 later_1.3.1
## [79] splines_4.3.1 dynamictable_1.5 lattice_0.21-9
## [82] survival_3.5-7 SparseM_1.81 tidyselect_1.2.0
## [85] miniUI_0.1.1.1 knitr_1.44 gridExtra_2.3
## [88] stats4_4.3.1 xfun_0.40 devtools_2.4.5
## [91] matrixStats_1.0.0 rstan_2.26.23 stringi_1.7.12
## [94] yaml_2.3.7 evaluate_0.22 codetools_0.2-19
## [97] cli_3.6.1 RcppParallel_5.1.7 rpart_4.1.19
## [100] xtable_1.8-4 munsell_0.5.0 processx_3.8.2
## [103] jquerylib_0.1.4 Rcpp_1.0.11 MatrixModels_0.5-2
## [106] ellipsis_0.3.2 prettyunits_1.2.0 profvis_0.3.8
## [109] urlchecker_1.0.1 mvtnorm_1.2-3 scales_1.2.1
## [112] crayon_1.5.2 rlang_1.1.1 multcomp_1.4-25
Domino User: apfela
Domino Project:
Domino Run ID:
Domino Run #: